R Data Wrangling and Manipulation

UC Berkeley Library and D-Lab

Fall 2017

dplyr/tidyr

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)

  1. Manipulating data frames, e.g. filtering, summarizing, and conducting calculations across groups.
  2. Tidying data into the appropriate format. Reshaping for analysis
  3. Attempting to describe relationships or conduct causal inference

Gapminder Dataset

Provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

gapminder <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Dataframe manipulation Base R

We can use base R functions to calculate summary statistics across groups of observaitons:

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15

See how this might get tedious?

Dataframe manipulation dplyr

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition.

install.packages('dplyr')

Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()
  6. arrange()

Now let’s load the package:

library(dplyr)

dplyr- select()

Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

dplyr- select()

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

dplyr- select()

dplyr %>%

The pipe operator is one of dplyr’s greatest strengths. It allows you combine several functions in a chain.

year_country_gdp <- gapminder %>% select(year,country,gdpPercap)

is equivalent to:

year_country_gdp <- select(gapminder, year, country, gdpPercap)

dplyr - filter()

Now let’s say we’re only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.

year_country_gdp_euro <- gapminder %>%
    filter(continent == "Africa") %>%
    select(year,country,gdpPercap)

dplyr/Calculations Across Groups

A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculated the mean GDP per capita for each continent? In base R, you would have to run the mean() function for each subset of data:

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])
## [1] 18621.61

dplyr/Calculations Across Groups

gapminder$mean.continent.GDP <- NA

gapminder$mean.continent.GDP[gapminder$continent == "Africa"] <- mean(gapminder$gdpPercap[gapminder$continent == "Africa"])

gapminder$mean.continent.GDP[gapminder$continent == "Americas"] <- mean(gapminder$gdpPercap[gapminder$continent == "Americas"])

gapminder$mean.continent.GDP[gapminder$continent == "Asia"] <- mean(gapminder$gdpPercap[gapminder$continent == "Asia"])

gapminder$mean.continent.GDP[gapminder$continent == "Europe"] <- mean(gapminder$gdpPercap[gapminder$continent == "Europe"])

gapminder$mean.continent.GDP[gapminder$continent == "Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])

split-apply-combine

dplyr- group_by()

dplyr - summarize()

gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)
## # A tibble: 5 x 2
##   continent mean_gdpPercap
##      <fctr>          <dbl>
## 1    Africa       2193.755
## 2  Americas       7136.110
## 3      Asia       7902.150
## 4    Europe      14469.476
## 5   Oceania      18621.609

dplyr - summarize()

dplyr - summarize()

group_by() multiple variables

gdp_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents_byyear)
## # A tibble: 6 x 3
## # Groups:   continent [1]
##   continent  year mean_gdpPercap
##      <fctr> <int>          <dbl>
## 1    Africa  1952       1252.572
## 2    Africa  1957       1385.236
## 3    Africa  1962       1598.079
## 4    Africa  1967       2050.364
## 5    Africa  1972       2339.616
## 6    Africa  1977       2585.939

dplyr - summarize()

group_by() multiple variables and defining multiple variable with summarize()

gdp_pop_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
head(gdp_pop_bycontinents_byyear)
## # A tibble: 6 x 6
## # Groups:   continent [1]
##   continent  year mean_gdpPercap sd_gdpPercap mean_pop   sd_pop
##      <fctr> <int>          <dbl>        <dbl>    <dbl>    <dbl>
## 1    Africa  1952       1252.572     982.9521  4570010  6317450
## 2    Africa  1957       1385.236    1134.5089  5093033  7076042
## 3    Africa  1962       1598.079    1461.8392  5702247  7957545
## 4    Africa  1967       2050.364    2847.7176  6447875  8985505
## 5    Africa  1972       2339.616    3286.8539  7305376 10130833
## 6    Africa  1977       2585.939    4142.3987  8328097 11585184

dplyr- mutate()

What if we wanted to add these values to our original data frame instead of creating a new object?

gapminder_with_extra_vars <- gapminder %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
head(gapminder_with_extra_vars)
## # A tibble: 6 x 11
## # Groups:   continent, year [6]
##       country  year      pop continent lifeExp gdpPercap
##        <fctr> <int>    <dbl>    <fctr>   <dbl>     <dbl>
## 1 Afghanistan  1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan  1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan  1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan  1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan  1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan  1977 14880372      Asia  38.438  786.1134
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## #   mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>

dplyr- mutate()

We can use also use mutate() to create new variables prior to (or even after) summarizing information.

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))
head(gdp_pop_bycontinents_byyear)
## # A tibble: 6 x 8
## # Groups:   continent [1]
##   continent  year mean_gdpPercap sd_gdpPercap mean_pop   sd_pop
##      <fctr> <int>          <dbl>        <dbl>    <dbl>    <dbl>
## 1    Africa  1952       1252.572     982.9521  4570010  6317450
## 2    Africa  1957       1385.236    1134.5089  5093033  7076042
## 3    Africa  1962       1598.079    1461.8392  5702247  7957545
## 4    Africa  1967       2050.364    2847.7176  6447875  8985505
## 5    Africa  1972       2339.616    3286.8539  7305376 10130833
## 6    Africa  1977       2585.939    4142.3987  8328097 11585184
## # ... with 2 more variables: mean_gdp_billion <dbl>, sd_gdp_billion <dbl>

dplyr - arrange()

gapminder_with_extra_vars <- gapminder %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop)) %>%
    arrange(desc(year), continent)
head(gapminder_with_extra_vars)
## # A tibble: 6 x 11
## # Groups:   continent, year [1]
##        country  year      pop continent lifeExp  gdpPercap
##         <fctr> <int>    <dbl>    <fctr>   <dbl>      <dbl>
## 1      Algeria  2007 33333216    Africa  72.301  6223.3675
## 2       Angola  2007 12420476    Africa  42.731  4797.2313
## 3        Benin  2007  8078314    Africa  56.728  1441.2849
## 4     Botswana  2007  1639131    Africa  50.728 12569.8518
## 5 Burkina Faso  2007 14326203    Africa  52.295  1217.0330
## 6      Burundi  2007  8390505    Africa  49.580   430.0707
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## #   mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>

Your turn

  1. Use dplyr to create a data frame containing the median lifeExp for each continent

  2. Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.

  3. Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)

Tidying Data

Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy

The two most important properties of tidy data are:

  1. Each column is a variable.
  2. Each row is an observation.

Tidying Data

“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy

“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham

Tidying Data

wide
##      name time1 time2 time3
## 1  Wilbur    67    56    70
## 2 Petunia    80    90    67
## 3 Gregory    64    50   101
long
##      name time heartrate
## 1  Wilbur    1        67
## 2 Petunia    1        80
## 3 Gregory    1        64
## 4  Wilbur    2        56
## 5 Petunia    2        90
## 6 Gregory    2        50
## 7  Wilbur    3        70
## 8 Petunia    3        67
## 9 Gregory    3        10

Which one of these do you think is the tidy format?